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Abstract: 

^ \ We present a method for incorporating the information contained in new datasets into 

\^ ' an existing set of parton distribution functions without the need for refitting. The method 

^ involves reweighting the ensemble of parton densities through the computation of the 

to the new dataset. We explain how reweighting may be used to assess the impact of 
any new data or pseudodata on parton densities and thus on their predictions. We show 
that the method works by considering the addition of inclusive jet data to a DIS+DY 
fit, and comparing to the refitted distribution. We then use reweighting to determine 
the impact of recent high statistics lepton asymmetry data from the DO experiment on 
the NNPDF2.0 parton set. We find that the DO inclusive muon and electron data are 
perfectly compatible with the rest of the data included in the NNPDF2.0 analysis and 
impose additional constraints on the large-x d/u ratio. The more exclusive DO electron 
datasets are however inconsistent both with the other datasets and among themselves, 
suggesting that here the experimental uncertainties have been underestimated. 
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A Distances between revi^eighted PDFs 



1 Introduction 



The determination of parton distribution functions (PDFs) and their uncertainties through 
global fits to datasets taken in deep inelastic and hadronic collision experiments, analysed 
using perturbative QCD, is one of the key ingredients in the exploitation of future experi- 
ments, in particular at LHC. Of course such fits can only be as good as the data that goes 
into them, so whenever there is new data or new experiments, the fits have to be redone 
to take the new data into account. This process is cumbersome and time consuming, and 
can only be performed using the same software as in the previous fits, and thus by the 
fitting collaborations themselves. 

In this paper we will show how, by using the ensembles of PDFs produced by the 
NNPDF collaboration [IHl]) anyone can determine the effect of new data on the PDFs 
quickly and easily: all that is required is a computation of the to the new data for each 
PDF in the ensemble f5]. With this information one can determine the overall impact of 
the new data on PDFs, their consistency with the older data used in the fit, the effect 
the new data have on the shape and precision of individual PDFs, and thus their effect 
on observables such as benchmark cross-sections or predictions for new physics. The same 
approach can be used just as easily to estimate the effects of pseudodata from proposed 
experiments or machines on PDFs and thus on cross-sections. 

The technique we use is based on statistical inference. In the NNPDF approach [THllE] 
we generate through a Monte Carlo procedure an ensemble of N PDF replicas, £ = 
{fk^k = 1, . . . , N}, each fitted to a data replica generated according to the uncertainties 
and their correlations as measured by the experimental collaborations. Each PDF is 
parametrized by a highly redundant neural network in order to avoid parameterization bias 
which would otherwise spoil the procedure, and the stopping point of the fit of each replica 
is determined using cross-validation to prevent over-fitting. The final PDF ensemble then 
forms an accurate representation of the probability distribution of PDFs0 conditional on 
the input data and the particular assumptions (such as NLO QCD, a value of a^, etc) 
used in the fits. 

Given an NNPDF ensemble one can evaluate any quantity or experimental observable 
0[f] depending on the PDFs (such as the PDF mean, the variance, PDF correlations, or 
indeed the mean, variance etc of any cross-section computed from them) by computing 
0[f] for each of the replicas, and averaging the results. This is because the integral in the 
space of functions is well approximated by the average over the ensemble £, so that the 
mean value of 0[f] given by 



Each of the replicas fk carries equal weight because they were generated through im- 
portance sampling: the replicas were fitted to a data replica generated according to the 
probability distribution of the experimental data, using a fitting procedure with no bias. 

We can include the effects of a new independent dataset without performing a new fit 
if we instead reweight the old fit according to weights Wk, which assess the probability 
that each PDF replica agrees with the new data. The reweighted ensemble then forms 
a representation of the probability distribution of PDFs conditional on both the old and 

^Throughout this paper we will denote 'parton distribution function' by PDF, but write out 'probability 
density function' in full, in order to avoid any confusion: both are probability densities, but in very different 
spaces. 




k=l 



(1) 



the new data. The weights are computed straightforwardly by evaluating the of the 
new data to each of the replicas. The mean value of the observable 0[f] taking account 
of the new data is then given by the weighted average 



The usefulness of this method is clear: it becomes possible to test the impact of a 
new dataset, or indeed the potential impact of MC data generated for a new experiment, 
quickly and simply without the need for a new fit (or indeed without considering explicitly 
any other datasets except the one under immediate consideration). This comes at a price: 
the effective number of replicas will be reduced, either because the new data prove to be 
very constraining (good), or because they are inconsistent with the old data (bad). We 
will provide a criterion to distinguish between these two cases. Of course if the new data 
are both consistent and precise, the effective number of replicas might be so reduced that 
a refit becomes necessary. 

One of the advantages of the reweighting method is that it can be used to assess the 
impact on the global fit of observables for which no fast code is available, and thus which 
cannot be included without resorting to iT-factor approximations. One such observable is 
the Tevatron W lepton charge asymmetry. Recent measurements [TlllOj of this quantity 
have attracted a lot of attention, since sizable tension with other data in the global fit 
sensitive to the large-x d quark distribution, such as DIS deuterium structure function 
data, has been reported |lllll2j . It is not clear from these studies whether this tension 
reflects an experimental problem of the recent Tevatron data, or whether the problems 
are with the DIS deuterium data, perhaps indicating the need for substantial nuclear 
corrections. With this motivation, armed with the statistical power of reweighting, we 
will here study the incorporation of the DO W lepton charge asymmetry data in the 
NNPDF2.0 fit. 

Reweighting is also important from a conceptual point of view. If more and more 
data are included in the fit through reweighting, the resulting PDFs become less and 
less dependent on the initial PDF. But PDFs obtained in this way then by construction 
satisfy the laws of statistical inference — for example, uncertainties will automatically 
behave upon inclusion of new data according to standard statisticsEl Hence, checking that 
the results obtained by reweighting coincide with results obtained by simply including the 
new data in the global fit provides a highly nontrivial check on the consistency of the 
NNPDF global fitting procedure. 

The paper is organised as follows. In the next section, we will explain how the weights 
can be computed, and give tests through which one can access quantitatively the impact 
of the new data and their consistency with the old data. A detailed proof of these results, 
with a full discussion of the subtleties, is given in section 3: this is important because an 
earlier attempt to implement a reweighting procedure [5| used an expression for the weights 
which differs from our result (a detailed examination of the result of Ref. [5] is presented 
in section 3.2)11 This section may be skipped by readers only interested in applying the 
new technique. In section 4 we show how the method may be used in practice by applying 
it to inclusive jet data: since there are existing NNPDF sets with and without this data, 

^Indeed, it was suggested in Ref. [13] that a PDF fit might be performed by including all data through 
reweighting of a first guess based on past experience. 

recent study by the LIfCb collaboration 14 using a reweighting technique to assess the impact 
of low pt Drell-Yan pairs at the LHC on PDF determinations, also appears to use the incorrect formula 
derived in [5J. 
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this allows us to check that reweighting works. Then in section 5 we illustrate the power 
of reweighting by using it to assess the impact of DO W lepton charge asymmetry data on 
the NNPDF2.0 PDFs. The results are particularly interesting because while the inclusive 
DO asymmetry data is perfectly compatible with the NNPDF2.0 set and their inclusion 
in the global fit results in a moderate improvement in the determination of the medium 
and large- d quark PDF, the more exclusive electron datasets turn out to be inconsistent 
both with other sets in the global analysis and among them. 



2 Reweighting 



2.1 The Weights 

We consider the situation where a set of experimental data have been used to construct 
a probability distribution for PDFs, Vo\d{f)- This probability distribution is delivered 
as a finite ensemble of PDFs £ = {/fc,A; = l,..., N}. Any observable can be obtained 
performing averages over this ensemble as prescribed in Eq. ([1]), that is, equally weighting 
each PDF. 

The problem we shall now address is how to update this probability distribution when 
new experimental data are available. There are two options: either we can construct a 
new probability distribution Vnewif) from scratch by including both old and new data in 
a new fit, or we can update the old fit by computing a weight, Wk, for each individual 
PDF /fc in the ensemble £ according to the rules of statistical inference. Then, Vnewif) 
is simply understood as an update (or reweighting) of the prior probability distribution 

Voldif). 

Both methods incorporate the same information, the old and the new data, and we will 
show below that when the weights are chosen correctly both techniques are statistically 
equivalent in the sense that when the number of replicas is sufficiently large they both 
give representations of the same probability distribution Vnewif)- However to calculate 
the weights involves only knowledge of the new data: all the relevant information about 
the old data is already contained in VoXdif)- It is thus substantially easier to implement, 
since no refitting is necessary. 

To be specific, we consider a set of n new data that have not been included in the 
determination of the initial probability density distribution: 

y = {yi,y2, ■■■ ,yn} ■ 

Clearly any instance of such a set of data can be seen as a point y in an n-dimensional 
real space. The experimental uncertainties are summarised by the n x n experimental 
covariance matrix cjjj, which includes a term that incorporates the overall normalization 
uncertainty p5], but reduces to a diagonal matrix in cases when experimentalists do not 
provide the correlated systematic uncertainties. We assume throughout that these new 
data are statistically independent of any of the data included in the original fit. 

Using statistical inference we can update the initial probability density 7^oid(/) by 
taking into account the new data, thereby obtaining an improved probability density 
Vnewif)- To do this we need to know the relative probabilities of the new data for different 
choices of PDF. Since the new data are assumed to have Gaussian errors, these probabilities 
will be directly proportional to the probability density of the x to the new data conditional 
on /: 

7'(xl/)oc(x2(y,/))|("-i)e-i>^'(^'/), (3) 
where if yi[f] is the value predicted for the data m using the PDF /, 

n 

x\y, f) = (y* - - %■[/])• (4) 

It follows from the statistical independence of the old and new data that by the law of 
multiplication of probabilities: 



Vnewif) =N^Vix\f)VoM, 



(5) 



where is a normalization factor, independent of /. 

Multiplying on both sides by some observable 0[f] and integrating over the PDFs, 



= ^xj o[f]v{x\f)roid{f)Df, 

N 

= J!Y.-^xnx\fk)0[h], (6) 



k=l 



where in the last line we used Eq. ([TJ. 

We can thus sample the probability density Vnewif) using the N replicas fk, but 
reweighted: in place of Eq. ([T]) we now have 



N 



{0)^e. = wY.''kO[h], (7) 



k=l 

where 



and xl = X^iVifk) is evaluated using Eq. ([4]). The normalization factor J\f^ is fixed by 
normalizing the new probability density Vnewif)'- taking the operator 0[f] to be the unit 
operator, (l)new = 1' ^o from Eq. ([7|) this fixes X^^i '^k = and thus using Eq. ([8]) 



ixl 



TrEk=iixl)^^" 2Xu 

The weights w^, when divided by N , are then simply the probabilities of the replicas /fc, 
given the x? to the new data. 

Note that our formula Eq. ([9]) for the weights is different from the one derived in 
Ref. |5j whenever the number of new data points is greater than one. The reason for 
this is that the use of Bayes theorem for multidimensional probability densities is subtle, 
since without care one may fall foul of the Borel-Kolmogorov paradox (see for example 
Ref. [T6] ) . A careful proof of the weights Eq. Q using the elementary rules of statistical 
inference is given in sec. 13.11 the subtle error in the argument used in Ref. ^ is explained 
in sec. 13.21 



2.2 Measuring Information Loss and Consistency 

The original ensemble of replicas 8 = {fk,k = 1, . . . ,N} is constructed through impor- 
tance sampling of the probability density Voidif), and thus each replica has equal weight. 
It is maximally efficient, in the sense this is the best representation of the underlying 
density Vo\d{f) for a given number of replicas A^: the only way to improve it is by increas- 
ing A^. After reweighting, this will no longer be the case, since in fact the weights give 
the relative importance of the different replicas, and the replicas with very small weights 
will become almost irrelevant in ensemble averages. The reweighted replicas will thus no 
longer be as efficient as the old: for a given A^, the accuracy of the representation of the 
underlying distribution "PncwC/) will be less than it would be in a new fit. 



We can quantify this loss of efficiency by using the Shannon entropy to compute the 
effective number of rephcas left after reweighting: 



N^s = exp^j^YlwkHN/wk)j. (10) 

Clearly < A'eff < A^: the reweighted fit has the same accuracy as a refit with A'efr replicas. 
Thus if A'efr becomes too low, the reweighting procedure will no longer be reliable, either 
because the new data contain a lot of information on the PDFs, necessitating a full refitting 
with more replicas, or because the new data are inconsistent with the old. 

These two cases can be distinguished by examining the profile of the new data: if 
in the reweighted fit there are very few replicas with a P^r data point of order unity, 
the errors in the new dataset have probably been underestimated. This profile may be 
easily evaluated: 

nx') = jrT.'^k, (11) 

k 

where the sum is over all replicas k such that xl £ [x^ , + dx^] ■ 

Alternatively, we consider inconsistent data as data whose errors have been underes- 
timated. If we rescale the uncertainties of the data by a factor a, we can use inverse 
probability to calculate the probability density for the rescaling parameter a: 

N 

P(a)oci^u;fc(a). (12) 

k=l 

Here Wk{a) are the weights Eq. ([8]) evaluated by replacing xl with Xfc/c^^ (and are thus 
proportional to the probability of fk given the new data with rescaled errors): averaging 
them in the reweighted fit thus gives the probability density for a. If this probability 
density peaks close to one the new data are consistent, while if it peaks far above one, 
then it is likely that the errors in the data have been underestimated. 

If the new data are reasonably consistent, we can assess whether they should be in- 
cluded in a new fit by calculating the x^ distribution of the dataset that would be used in 
the new fit (i.e. all the old data plus the new data), using the reweighting procedure as in 
Eq. (jlip . Comparison to the old x^ distribution then tells us whether the new data would 
improve the fit: if so the peak should shift a little towards one, with a slight narrowing due 
to the increase in the total number of data points. This may be quantified by comparing 
the area under the curves in a given range. 



3 Statistical Inference 



3.1 Proof of the Weight Formula 

Here we give a careful derivation of the rules for reweighting presented in the previous 
section. Some readers may consider skipping this section and simply use the practical 
prescription as given in Eq. Q . Our argument is based on the standard use of statistical 
inference. However some of the details are subtle, since we need to use probability densities 
in multi-dimensional spaces, and thus need to take care with limits. 

By the probability P{f) for the PDF / what we actually mean is the probability 
P{f\K), where K denotes all the data used in the determination and their associated 
errors, the values of parameters such as Og and heavy quark masses used in the computation 
of the data expected from the PDF, and finally also the theoretical framework used (for 
example NLO QCD). If we then wish to extend the dataset by including new data y, the 
new probability Pncw(/) is then P{f\yK): besides K we now also assume the new data y. 

The new probability is then determined from the old probability using the sampling 
distribution P{y\fK) and multiplicative rule for probabilities (often known as Bayes the- 
orem) : 

P{AB\C) = P{A\BC)P{B\C) = P{B\AC)P{A\C), (13) 

where P{A\C) is the probabilities of A given C, etc, and AB denotes A and B. Naively 
applying this result in the present case we have 

P{f\yK)P{y\K) = P{y\fK)P{f\K), (14) 

whence (replacing P{f\K) with V{f\K)Df, P{f\yK) with V{f\yK)Df) 

V{f\yK) = P{y\fK)r{f\K)/P{y\K). (15) 

Note that P{y\K) does not depend on the PDF /, and can thus be determined simply by 
insisting that V{f\yK) is properly normalized: we then find 

P{y\K) = I P{y\fK)V{f\K)Df, (16) 

so 

r{f\yK) = P{y\fK)V{f\K) / j P{y\fK)V{f\K)Df, (17) 

where everything on the right hand side is now known. 

This argument would work without problems if the data y could only take discrete 
values. The difficulty in the present case is that our data are continuous, so rather than 
the probability P{y\fK) we have to work with a multi-dimensional probability density 
'P{y\fK)d"'y, in a limit in which the volume element d'^y goes to zero. Of course in this 
limit the probabilities P{y\fK) and P{y\K) also go to zero, and we find a ratio of two 
zeros in Eq. (jlSp . The conditional probability P{f\yK) is then only well defined if we 
specify carefully the way in which the limit is to be taken: probabilities conditional on 
sets of measure zero are ambiguous. Failure to specify the limiting process can result in 
contradictions (the Borel-Kolmogorov paradox [16j). 

Consider then the probability density for the data y. Assuming that the new exper- 
iments are not correlated with any of the experiments used in the determination of the 
initial probability density, the probability density of y is then given by Eq. (flB]) : 

N 

V{y\K)= V{y\fK)V{f\K)Df = j^Y.^{y\hK), (18) 

k=i 



where in the second step we used Eq. ([T]). The density V{y\fK) gives the probabihty that 
the new data he in an infinitesimal volume (Py centred at y in the space of possible data 
given a particular choice of PDF /: it is often called the sampling distribution or (when 
considered as a function of /) the likelihood function. Assuming that the uncertainties in 
the data are purely Gaussian, 

V{y\fK)ry = (27r)-"/2(det aij)-i/2g-ix^(y,/)^ny^ (^^g) 

where (?/>/) is calculated using Eq. ^ (and of course using the assumptions K in 
the computation of the predictions ?/«[/])• ^^^^ volume element cPy is independent of /: 
without a specific prediction, all data are assumed equally likely. 

Since to compute V{y\fK) it is sufficient to compute x^iv^f)^ it is sufficient for our 
purposes to consider the probability density for the x? to the new dataset: 

Vix\fK)dx = 2i--/2(r(n/2))-i(x(y,/))'^-^e-5X^to'/)dx, (20) 

where xiVif) = {x^iVi DY^"^ ■ This distribution may be readily derived from Eq. (fT9ll 
by diagonalising the covariance matrix and rescaling the data to a set {1^} of indepen- 
dent Gaussian variables each with unit variance. Then (Py = (det cTjj)^/^(i"y, and x^ = 
Z]r=i ^i'- Choosing n-dimensional spherical co-ordinates in the space of data (with x as the 
radial co-ordinate, and thus y = y[f] as the origin), we may write (F'Y = AnX"'~^ dxd"'~^0., 
where d^~^VL is the measure on the sphere and An = 27r"'/^(r(n/2))~-^ is the area of the 
unit sphere in n-dimensions. The probability Eq. (119p may thus be written 

V{y\fK)cry = (27r)-"/2e-5x'(j/J)(i"y 

= 2i--/2(r(n/2))-i(x(y,/))"-^e-^>^'(^'^)dxd"-il^, (21) 

in agreement with Eq. (j20p provided 

V{y\fK)d!'y = V{x\fK)dxr-'n. (22) 

Again the probability density V{x\K) for the x of the new dataset is obtained by averaging 
over replicas: 

N 

V{x\K) = / V{x\fK)V{f\K) Df = j^Yl T'ixlfkK); (23) 

k=i 

so combining Eq. ([ISD, Eq. i^, and Eq. (gSD 

N 

V{y\K)d^y = ^ ^P(x|/fci^)dxd"-iO = r{x\K)dxd^-'n, (24) 

k=l 

since both the volume factor d'^~^Q and the interval dx are independent of the choice of 
replica, and may thus be taken out of the sum: this follows directly from the assumption 
that the measure d^y in Eq. (fT9l) is independent of /. 

The advantage of using T'{x\fK) instead of V{y\fK) when evaluating Eq. (fT5|) is that 
V{x\fK) is only a one dimensional density, so taking the limit in which the volume element 
goes to zero is straightforward and unambiguous. We may write Eq. ()15p as 



V{f\xK)Df V{x\K)dx = V{x\fK)dx V{f\K)Df. 



(25) 



The marginalization Eq. (j23p follows directly on integration over /, since if V{f\x) is 
correctly normalized, J V{f\xK)D f = 1. Now, cancelling the dx from either side of 
Eq. (|25|) (since this is just a pre-assigned interval), 

V{f\xK)Df = T^^V{f\K)Df. (26) 

Multiplying on both sides by some observable 0[f] and integrating over the PDFs, 

(O)new = / 0[f]V{f\xK)Df, 

where in the last line we used Eq. ([1]). This corresponds to the reweighting Eq. ([7]) with 
weights 

Combining Eq. with Eq. and Eq. i^, we obtain Eq. 

Note that a further application of Bayes' theorem to Eq. ()28p gives the alternative 
form 

since because the replicas are uniformly distributed, P{fk\K) = 1/N. Thus w^/N is the 
probability of replica fk given the x to the new data. 



3.2 The Naive Prescription 

It is instructive to also derive the weights working directly with the probability density 
V{y\fK): using Bayes' theorem we may write instead of Eq. (j25p 

V{f\yK)Df V{y\K)cry = V{y\fK)<ry V{f\K)Df. (30) 

Again, the marginalization Eq. (llSp follows directly from the requirement that V{f\yK) 
be normalized, i.e. that J V{f\yK)D f = 1. 

Naively cancelling the volume factor dTy from either side, and pursuing the same 
argument as before yields: 

= j 0[f]V{f\yK)Df, 

= jo[f]^^V{f\K)Df, 

_ 1 ^ nv\fkK) 



This would then lead to the conclusion of Giele-Keller [5j that the weights are proportional 
to V{y\fkK)/V{y\K), and thus (using Eq. ([19]) 1 are given by 



When n > 1 this result is clearly different from our previous result Eq. (j28p . We see 
explicitly that the densities V{f\xK) and V{f\yK) are not the same, despite the fact that 
when the data y take a given value, x takes a corresponding value. 

It is also clear that the Gaussian weights Eq. ()32p must be incorrect: in the limit where 
the number of new (and consistent) data n becomes very large, xi peaks around n, and 
only the very few replicas in the tail of the distribution (x| ^ n is very unlikely) will 
survive. By contrast with the correct weights Eq. (I28p . replicas with x^ ^ n will dominate 
the fit, replicas with very small or very large x^ being suppressed. 

The reason for the difference between the results Eq. (I28p and Eq. (j32p is the Borel- 
Kolmogorov paradox [16J: when dealing with multi-dimensional probability densities care 
must be taken with limits, since a conditional probability on a set of measure zero is not 
well defined. Here the limits used to derive V{f\xK) and V{f\yK) are different, and thus 
so are the results. The correct result can only be obtained by taking the appropriate limit. 

The probability density T'{f\xK) is defined as the probability density for / given that 
the X lies in the finite interval [XiX~^dx], in the limit dx 0. In this case the conditioning 
variable spans a one-dimensional manifold, and therefore there is no freedom in the choice 
of the limiting procedure. The definition of V{f\xK) is unique, and thus the argument 
which leads to Eq. (I28p unambiguous. However the probability density V{f\yK) is defined 
as the probability density for / given that y lies in some volume Vn, in the limit — )• 0. 
In a multi-dimensional space such as this, the conditional probability density V{f\yK) is 
ambiguous, since it depends on the way the volume element Vn is chosen, and then taken 
to zero. Different definitions correspond to different physical settings. In the argument 
which led us to Eq. ([32]) . we implicitly assumed that Vn was the compact volume d^y 
centred on y, so that as Vn — ?• 0, the point y was uniquely selected. However there are 
many points in the space of data which have the same x^; and thus the same effect on 
/. We must include all these points with equal weight when determining the conditional 
probability density of / given y, so we need to sum over all the compact volumes d^y that 
build the (n — l)-dimensional level surfaces of xiVif) through the point y. Thus Vn is 
a thin shell with thickness dx, and hence its total volume is proportional to Andx- The 
limit 14 — )■ is then taken by letting dx — >• 0. We should thus write Eq. pO|) as (using 
Eq. ([22]) and Eq. ([MD) 

V{f\yK)Df V{x\K)Andx = V{x\fK)Andx V{f\K)Df. (33) 

Cancelling the volume factor Andx, since this is independent of /, this definition is the 
same as Eq. ([26]) . and thus yields the correct weights Eq. Q in the limit Vn — )• 0. 

3.3 Derivation of the Consistency Tests 

Finally we consider the derivation of the two diagnostic results Eq. (jlip and Eq. (112p . The 
first is simply the 'evidence' Eq. ([23|) . evaluated by binning in x?. The second is more 
involved: using Bayes Theorem 

V{a\xfkK)daV{x\fkK)dx = V{x\afkK)dxV{a\hK)da. (34) 

Now the likelihood VixWfkK) may be evaluated using the usual formula Eq. ()20p . and 
by noting that the effect of a is to rescale x^ — t- x^ jc?: we accordingly denote the result 
by Wk(a). The prior density V{a\fk,K) we assume is uniform in In a, since a is a scale 
parameter (this ensures that the results are invariant under a — )■ 1/a). We thus find 



where the overall normalization has been fixed by integrating over a. Then as usual 
V{a\x,K) = I Dfr{a\x,f,K)r{f\x,K) 

It is easy to show by a change of variable that the integrals in the denominator are the 
same for all k, whence we find Eq. (|12p . 



4 Validation: Inclusive Jets 



As a demonstration of the effectiveness of our reweighting procedure, we first apply it 
to a dataset that has already been included and studied in the NNPDF2.0 analysis [3]. 
We thus start with the fit obtained including only the DIS and Drell-Yan data, call this 
NNPDF2.0(DIS+DYP), and then add the inclusive jet data from Tevatron Run II [HI 
118] . which were included in the NNPDF2.0 analysis, through reweighting. The resulting 
reweighted fit can then be compared directly with the NNPDF2.0 fit, which includes 
the same DIS, Drell-Yan and Tevatron inclusive jet data. Given the consistency of the 
inclusive jet data with the DIS and Drell-Yan data demonstrated in Ref. P], we expect the 
reweighted and refitted distributions to give results that are equivalent up to statistical 
fluctuations. 

Note that from this section on we will slightly change the notation to make it more 
similar to that of previous NNPDF studies: A^rep will denote the number of replicas in the 
sample and A^dat the number of data points in the set which is added through reweighting. 

To obtain the reweighted PDFs, all that has to be done is to compute the xt 
replica k to the inclusive jet data, using Eq. for each of the Aj-ep = 1000 replicas 
of the NNPDF2.0(DIS+DYP) parton set. For the inclusive jet data the total number 
of data points is Ajat = 186, and the covariance matrices are as given by the CDF and 
DO collaborations, the normalization uncertainty are incorporated using the to-™ethod, 
as discussed in Ref. [Hdl]. The weight associated with each replica in then computed 
according to Eq. ([9]): specifically we evaluate 

ek = l{{Nd.t-l)logxl-xl) (37) 

hence if (e^) = Y2k=i ^k, the weights are given by 

Nrcp 

Wfc = A/'exp[efe - (efc)], TV = Arep/ ^ exp[efc - (efc)]. (38) 

fe=i 

The subtraction of (efc) in the exponent is introduced to avoid numerical problems. We 
set to zero all weights for which exp[efc — (efc)] < 10^"*^^. 

The Xk distributions for the jet data before and after reweighting, the P{a) estimator 
and the distribution of weights are shown in Fig. [TJ We notice that before reweighting the 
distribution of P^r data point is peaked close to one, but with a long tail extending to 
higher values of x^- This has to be expected since the inclusive jet data are not included 
in the NNPDF2.0(DIS+DYP) set. However 82% of the replicas have 0.5 < < 2, 
confirming that the inclusive jet data are likely to be consistent with the other data in the 
fit and their inclusion in the fit will have only a moderate impact. Indeed a significant 
fraction of the weights are of order one, with however a long tail of small weights for 
replicas which will be effectively eliminated once the inclusive jet data are included. 

To make these statements more quantitative, we can now evaluate the number of 
effective replicas, determined through the Shannon entropy according to Eq. (jlOp : the 
effective number of replicas after reweighting using the jet data is N^s = 332, i.e. around 
a third of the replicas are left. 

To examine the consistency of the inclusive jet data with the DIS and Drell-Yan data 
used in NNPDF20(DIS+DYP), we show in Fig. [T]the reweighted x^ distribution computed 
according to Eq. (llip . Clearly the replicas which gave a poor fit to the jet data have now 
been removed, and the result is a distribution of peaked at one, which shows that the 




Figure 1: Upper plots: distribution of xt/^dat (the per data point) and the weights Wk in the 
reweighting of the NNPDF20(DIS+DYP) set using the inclusive jet data. Lower plots: Distribution 
of the reweighted distribution of the inclusive jet data, and the probability distribution 'P{a) 
of the error rescaling parameter a. 

jet data are consistent with the DIS and Drell-Yan data. This conclusion is reinforced by 
the probability distribution 'P(a), plotted in Fig. [TJ the most probable value for a is close 
to one, showing that the overall size of the experimental errors of these data have been 
well estimated by CDF and DO. 

In order to determine quantitatively if indeed the refitted and reweighted PDF sets 
represent statistically identical distributions, we can compute the distances between central 
values and uncertainties of different PDF combinations, as defined in Ref. with the 
required modifications to account for the individual weights of each replica0 In Fig. [2] we 
plot the distances between PDFs' central values and uncertainties for the reweighted set 
and the (refitted) NNPDF2.0 set. Note that distances have been computed between sets 
of A^rcp = 100 replicas. The distance is normalised such that distances d ~ 1 correspond 
to statistically identical distributions. We see that to a very good approximation the 
refitted and the reweighted sets are statistically equivalent, both for central values and 
uncertainties. The very largest distances, d ~ 2, corresponding a difference of about one 
seventh of a standard deviation of the measured quantity. 

Given that as shown in Fig. [2] the refitted and reweighted sets are statistically equiva- 
lent, we know from [3] that inclusive jet data constrain only the large-x gluon, leaving vir- 
tually unchanged all other distributions. The reweighted gluon distribution and its uncer- 
tainty are shown in Fig.[3l compared with the original distribution, the NNPDF20(DIS-|-DYP) 
fit, and with the full NNPDF2.0 fit. On the left hand side we plot the gluon distribution 
with its uncertainty band and on the right hand side the absolute value of the uncertainty. 



^The expressions for the distances for reweighted PDF sets are collected in Appendix 1X1 
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Figure 2: Distances between PDFs (above) and uncertainties (below) for the NNPDF2.0 set and 
a set obtained adding the Tevatron inclusive jet production data to the NNPDF2.0(DIS+DY) fit 
using the reweighting technique. The distances have been computed between sets of A^rep = 100 
replicas. 

The reweighted and refitted distributions are indeed shown to be equivalent within errors. 
In particular the error of the medium and large~x gluon is sensibly reduced by the inclu- 
sion of the Tevatron inclusive jet data while, the other PDFs are essentially unchanged in 
both the refitted and the reweighted sets. 

This statistical equivalence is an important check on the consistency of the NNPDF 
fitting methodology and the reweighting method presented here. In particular, it shows 
that an NNPDF parton fit (at least in the case examined here) behaves in a way which is 
consistent with the laws of statistical inference: since reweighting is simply an application 
of probability theory, and since reweighting and refitting can be used interchangeably, the 
results obtained from the global fits indeed behave as probability distributions. 




Figure 3: The gluon distribution (left) and its uncertainty (right) of the NNPDF2.0(DIS+DY) 
fit before and after reweighting with the inclusive jet data compared to the refitted gluon from 
NNPDF2.0 on a hnear scale. 



5 Application: the W lepton asymmetry 



Now that we have exphcitly verified that reweighting works, we can use it to assess the 
impact on PDF determination of data which were not included in the NNPDF2.0 fit. In 
this section we consider the Tevatron DO W lepton charge asymmetry high luminosity 
data from Run II ^jj^. This data have attracted a lot of interest recently because of their 
potential inconsistency with other datasets which are traditionally included in the global 
fit like the deuterium DIS data [Tn[T2] . 



5.1 Motivation 



In proton-antiproton scattering, bosons are mainly produced by the annihilation of a 
u{d) quark in the proton with the d{u) in the anti-proton. An asymmetry in the and 
W~ rapidity distributions is the result of a difference between the u and d distributions in 
the proton. Therefore, the information on the W charge asymmetry |19j provides a further 
constraint on the u and d PDFs. However, due to the unknown longitudinal momentum of 
the neutrino, the vector boson rapidity is difficult to determine directly. What is typically 
measured [THlOj is instead the lepton charge asymmetry. The vector boson rapidity may 
then be deduced in terms of the pseudo-rapidity of the charged lepton and its transverse 
energy EI^,. Moreover, if the transverse energy Eip of the outgoing lepton is relatively small, 
the leading sea contribution u—d is enhanced relative to the valence- valence contributions, 
so the lepton charge asymmetry also probes the separation into valence and sea quarks. 
For this reason in some experimental analysis [9l[T0], the lepton asymmetry is measured 
in different bins of E!^. 



Ratio to NNPDF2.0 | 



F 7 — 

■3r / 

/ / 

'S^r^^- 


1 ••. ■ 

9 




8 1 |nNPDF2.0 




^□CT10 




: O MSTW08 

6" ' ' ' ' ' ' ' ' 


, , 1 , , , , 1 , , , , 1 , , , , 1 , . , 



- NNPDF2.0 
■ CT10 
MSTW08 





0.2 0.3 0.4 0.5 0.6 0.7 



Figure 4: The dju ratio at large x computed at = 2 GeV^ from the NNPDF2.0, MSTW08 and 
CTIO sets. We show the results for the ratio normalized to NNPDF2.0 (left plot) and the relative 
PDF uncertainties in each case (right plot). All uncertainties are Icr. 

Historically, the Tevatron W lepton charge asymmetry data have been used in global 
fits together with the deuterium DIS data from BCDMS and NMC to constrain the ratio 
of d to u quarks at large-a;. One advantage with respect DIS data is that theoretical 
uncertainties linked to the deuterium target, like nuclear effects, are not present for the 
lepton asymmetries, where only proton PDFs are involved. In Fig. |3]we show the dju ratio 
computed using different recent PDF sets: NNPDF2.0, CTIO and MSTW08, together 
with the relative uncertainties. It is clear that PDF uncertainties are sizable for this 
combination at large-x, thus additional precision measurements of the W asymmetry are 
useful to reduce PDF errors in this region. We notice that in the kinematic region probed 
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Figure 5: Predictions for the DO W lepton charge asymmetry obtained with the DYNNLO code 
at next-to-leading order, using the NNPDF2.0 g], CTIO [H] and MSTW08 [H] parton sets. We 
show results for the muon charge asymmetry (top left), and the electron charge asymmetry in the 
inclusive bin, > 25 GeV, binA(top right), and then in less inclusive bins, 25 GeV < < 35 
GeV, bin B (bottom left), and E^ > 35 GeV, binC (bottom right). 



by the Tevatron measurements (0.1 <x< 0.7) the predictions from the three sets are in 
reasonable agreement within the respective uncertainties. 

In the NNPDF2.0 analysis only the CDF W boson direct asymmetry data of Ref. |19| 
are included. This observable is implemented in the fitting code at next-to-leading order, 
without reverting to a K-factor approximation, using the FastKernel method described 
in [4J. The W lepton asymmetry measurements, on the other hand, were not included in 
the analysis due to the lack of a fast implementation. However, the recent development 
of the APPLGRID [20] interface is likely to facilitate the future inclusion of these data 
directly in our fits. Thanks to the reweighting technique presented earlier in this paper, 
we can now study the impact of the lepton asymmetry data consistently in NLO QCD. 

Here we will consider the electron and muon asymmetry measurements performed 
by the DO collaboration at Run H of the Tevatron and published in Refs. [HIE]. The 
more recent DO muon analysis of Ref. [10] has not been included since the data are still 
preliminary. The datasets included in our analysis are the same as those included in the 
dedicated CTIOW analysis [11]. The lepton asymmetry measurements from CDF [7] are 
not considered here since the direct CDF W asymmetry data is already included in the 
NNPDF2.0 fit. 

Let us discuss in more detail the lepton asymmetry data that we consider. In Ref. [8] 
a measurement of the muon charge asymmetry based on 0.3 fb~^ of data is presented. 
The asymmetry measurement is binned in ten bins in the muon pseudo-rapidity in the 
range \r]^\ < 2, and cuts are imposed on the transverse energy and mass of the muon: 



Set 


A'dat 


NNPDF2.0 


MSTW08 


CTIO 


DO At (E!^ > 20 GeV) 8. 


10 


0.62 


1.51 


0.70 


DO electron, > 25 GeV (bin A) [9] 


12 


2.12 


9.20 


4.07 


DO electron, 25 GeV< E^ < 35 GeV (bin B) [9] 


12 


4.75 


1.66 


9.48 


DO electron, E^ > 35 GeV (bin C) [S| 


12 


5.06 


13.4 


11.7 



Table 1: The DO W lepton charge asymmetry datasets that are included in the present analysis, 
together with the per data point obtained from the NLO predictions of various PDFs sets. The 
electron data of Ref. ^ is divided into three bins that we denote by bin A, bin B and bin C. 

Ej, > 20 GeV and Mt > 40 GeV. In Ref. [9] a similar measurement of the electron 
charge asymmetry is presented based on 0.75 fb~^ of data. The asymmetry is binned in 
twelve bins in the electron pseudo-rapidity in the range \r]e\ < 3.2, and cuts are imposed 
on the missing energy and transverse mass: ^ > 25 GeV and Mj- > 50 GeV. Three 
sets of measurements are then given, which have different cuts in the transverse energy 
of the electron: an 'inclusive' bin which has > 25 GeV (which we refer to here as 
bin A), and two less inclusive bins with more restrictive cuts on the transverse energy, 
25 GeV < E^ < 35 GeV (bin B) and E^ > 35 GeV (bin C). Note that bins B and C 
together cover the same kinematic range as the more inclusive bin A. 

To analyse these data using the reweighting technique we use the DYNNLO code [22] to 
compute the theoretical predictions for the lepton asymmetries at NLO, using NNPDF2.0 
as input parton densities. This code is a parton level Monte Carlo program designed to 
compute exclusive hadronic processes up to NNLO, and it enables the user to implement 
the same cuts used in the experimental analyses. 

Before considering reweighting, let us first compare in Fig. [5] the predictions obtained 
with DYNNLO and different PDF sets at NLO, for the various DO datasets. It is perhaps 
surprising that, even though none of these data are included into the NNPDF2.0 fit, the 
prediction obtained from the NNPDF2.0 is in general closer to the experimental data than 
the predictions obtained with the other parton sets: the reason can be traced back to the 
somewhat larger d/u ratio in the range 0.2 < x < 0.6 (Fig. H]) for the NNPDF2.0 set. The 
exception is bin B, for which MSTW08 provides the best description. 

The quality of the comparison of various PDF sets with the asymmetry data can be 
quantified by evaluating the to each data set. For all the Tevatron Run II DO lep- 
ton asymmetry only the statistical and uncorrelated systematic errors are quoted. The 
covariance matrix is therefore diagonal and its elements are given by the sum in quadra- 
ture of the statistical and the uncorrelated systematic errors. There is no normalization 
uncertainty since the asymmetry is a ratio of cross-sections. 

The value of the psr data point and the number of data points for each set con- 
sidered in the present analysis are shown in Table [H The results confirm the studies 
performed in Ref. |23j . In particular, the less inclusive data (bins B and C) are rather 
poorly described by all the current PDF fits, with the exception of bin B which MSTW08 
describes reasonably well (though at the cost of a very bad fit to bins A and C). Note 
however that Ref. [H] uses the RESBOS program [24] to compute the predictions for the 
W lepton asymmetry. RESBOS computes on top of the NLO higher order corrections 
from pt resummation. The differences between NLO and RESBOS are maximal in the 
kinematics of the electron bin B data. This differences might explain, at least in part, the 
values of the for CTIO obtained in Table [T] compared to those given in Ref. [llj . 

We now consider the effect of including the Run II DO muon and electron asymmetry 
data in the NNPDF2.0 analysis using reweighting. We will consider each dataset in turn. 




Figure 6: Distribution of the xl ^^'^ weights Wk, the reweighted x^-distribution and the 
probability distribution V{a) in the reweighting of the NNPDF2.0 PDF set using the DO muon 
asymmetry data [8]. 

concentrating first on the inclusive sets (muon and electron bin A), and turning later to 
the less inclusive data sets (bins B and C). For each case we will proceed as follows: first 
we provide the distribution of x| before and after reweighting, the probability distribution 
P (as) and the distribution of weights. We then compare the reweighted PDFs to exper- 
imental data. Finally we compute the distances between the original and the reweighted 
sets, and compare the corresponding PDFs where they differ substantially from the original 
ones. 

Unless otherwise stated, PDFs and their uncertainties will be plotted at the scale 
Q"^ = Ql = 2 GeV^. The N.^p = 1000 NNPDF2.0 set is used throughout, with the 
exception of the computation of distances, where we instead use sets of 100 replicas. 

5.2 Inclusive data 

Let us first consider the inclusion of the DO muon charge asymmetry data The distri- 
bution of the Xk corresponding weights for these data is shown in the upper plots 
of Fig. m Since the x^-distribution is peaked close to one, the weights are also mostly of 
order unity. The reweighted x^j ^-nd probability density for the rescaling parameter a are 
shown in the lower plots: they peak a little below one, suggesting that the errors on these 
data are actually likely to have been overestimated by DO. After reweighting the x^ per 
data point drops from 0.62 to 0.51, and the number of effective replicas is N^s = 795. 

On the left in Fig. [7] we show the muon charge asymmetry before and after the reweight- 
ing. Indeed the predictions get closer to the data, once the PDFs are reweighted. We have 
also examined the effect on the shape of the PDFs, but the effects are negligible apart 
from a slight reduction in the uncertainty of the total valence distribution, shown in Fig. [HI 



Figure 7: Left: W muon charge asymmetry computed for the NNPDF2.0 PDFs before and after 
the reweighting of these data into the parton analysis. Right: W electron charge asymmetry 
(inclusive bin) computed with the NNPDF2.0 PDFs before and after the reweighting of these data 
in the parton analysis. 
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Figure 8: Total valence PDF for the NNPDF2.0 and NNPDF2.0 + DO muon data PDF sets. 
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Figure 9: Distances between NNPDF2.0 and NNPDF2.0 + DO W lepton asymmetry measurements 
from the muon dataset. The NNPDF2.0 set with A^i-cp ~ 100 has been used in the computation of 
the distances. 

This is confirmed by the distance analysis, Fig. [9l that shows that central values and un- 
certainties for all PDFs are essentially unchanged, with the exception of the total valence 
PDF where the inclusion of muon data has a moderate effect. 



Set 


X2.0 


X2.0+H 


X2.0+binA 


X2.0+binA+fi 


X2.0+binB 


X2.0+biiiC 


NMC-pd 


0.99 


0.98 


0.98 


0.98 


0.97 


1.13 


NMC 


1.72 


1.72 


1.69 


1.70 


1.72 


1.72 


SLACp 


1.55 


1.55 


1.53 


1.54 


1.50 


1.63 


SLACd 


1.12 


1.12 


1.07 


1.09 


1.05 


1.24 


BCDMSp 


1.35 


1.35 


1.33 


1.34 


1.41 


1.35 


BCDMSd 


1.16 


1.16 


1.16 


1.16 


1.24 


1.14 


HERAl-NCep 


1.35 


1.35 


1.34 


1.34 


1.33 


1.35 


HERAl-NCem 


0.86 


0.86 


0.86 


0.86 


0.86 


0.86 


HERAl-CCep 


0.96 


0.96 


0.94 


0.94 


1.02 


0.92 


HERAl-CCem 


0.56 


0.56 


0.56 


0.56 


0.56 


0.57 


CHORUSnu 


1.08 


1.08 


1.08 


1.08 


1.11 


1.10 


CHORUSnb 


0.86 


0.86 


0.86 


0.86 


0.87 


0.90 


FLH108 


1.50 


1.50 


1.50 


1.50 


1.47 


1.50 


NTVnuDMN 


0.69 


0.66 


0.67 


0.65 


0.82 


0.60 


NTVnbDMN 


0.70 


0.70 


0.69 


0.69 


0.72 


0.81 


Z06NC 


1.24 


1.24 


1.24 


1.24 


1.23 


1.26 


Z06CC 


1.19 


1.19 


1.19 


1.19 


1.15 


1.21 


DYE605 


0.86 


0.86 


0.84 


0.85 


0.87 


0.85 


DYE886p 


1.31 


1.32 


1.29 


1.30 


1.28 


1.36 


DYE886r 


0.83 


0.79 


0.67 


0.71 


1.08 


0.72 


CDFWASY 


1.88 


1.88 


1.78 


1.82 


2.05 


1.60 


CDFZRAP 


1.74 


1.77 


1.75 


1.77 


1.37 


1.97 


DOZRAP 


0.59 


0.59 


0.59 


0.59 


0.60 


0.61 


CDFR2KT 


1.02 


1.02 


0.95 


0.97 


1.21 


0.93 


D0R2CON 


0.86 


0.86 


0.84 


0.84 


0.91 


0.84 


TOTAL 


1.14 


1.14 


1.13 


1.13 


1.16 


1.16 



Table 2: per data point of all the experiments included in the NNPDF2.0 fit evaluated before 
and after reweighting with the various lepton asymmetry data sets. Note that here we use the io 
covariance matrix in the evaluation of the y^: the numbers are thus slightly different from those 
shown in Ref . 4^ . The cases in which the varies significatively as compared to the reference are 
highlighted in boldface. 

To study the compatibility of these data with the data included in the NNPDF2.0 
analysis, in Tab. [2]we show the of each of the datasets included in the NNPDF2.0 anal- 
ysis evaluated with the original NNPDF2.0 PDFs and then with these PDFs reweighted 
by the inclusion of the DO muon asymmetry data. If anything, there is a slight improve- 
ment in the description of most of the datasets. To summarize, the DO muon asymmetry 
data [8] are perfectly consistent with NNPDF2.0, but are not sufficiently precise to add 
much information to the PDFs. It will be interesting to assess the impact of the higher 
statistics DO Run II muon data set JlOj once the analysis is completed. 

Next we consider the inclusive DO electron data (bin A) with Ef^ > 25 GeV. The results 
are shown in Fig. [TOj Once included in the fit through reweighting the foi" this set drops 
from 2.12 to 1.55. While the distribution of the unweighted xl peaked above two and 
has a long tail to higher values, after reweighting the peak is shifted much closer to one. 
This is achieved through a substantial reduction in the effective number of replicas: after 
reweighting N^f^ = 262. However, while before reweighting only 16% of replicas lie in the 
region ^ < < 2, after reweighting this figure rises to 78%. This behaviour is confirmed 
by the plot of V{a): the data indicate that the most probable value of a is around 1.6, 
indicating that experimental errors on these data are underestimated. Taken together, 
this shows that these data might have a significant effect on constraining the PDFs, while 
still being broadly consistent with all the other data included in the fit. 
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Distribution of weights 




Figure 10: Distribution of the xl ^^'^ weights Wk, the reweighted ^^-distribution and the 
probability distribution Via) in the reweighting of the NNPDF2.0 PDF set using the DO electron 
asymmetry data (bin A) 



The improvement in the description of the electron asymmetry after reweighting in 
Fig. [71 while the fit to the other datasets included in the NNPDF2.0 fit shows no significant 
deterioration: if any change has to be noticed, is a slight improvement in particular in the 
fit to the CDF W asymmetry data (see Tab. E]). 

In Fig. [12] we plot the distances between the prior set NNPDF2.0 and the reweighted 
set: it is clear that the most significant effect is on the uncertainty in the valence PDF. 
Indeed, in Fig.[TT]we show the error reduction that comes from the inclusion of the inclusive 
DO electron charge asymmetry data on the valence PDF. While the central value remains 
essentially unchanged, the uncertainty is significantly reduced. Small improvements in the 
precision of the singlet and triplet quark distributions can also be observed, while other 
PDFs combination remain unchanged. 

Having found that both the inclusive muon and electron (bin A) DO asymmetry data 
are each consistent with the datasets used in NNPDF2.0, it is interesting to ask whether 
they are also consistent with each other. This is not obvious a priori: it is in principle 
possible for each dataset to prefer a different subset of the NNPDF2.0 replicas. 

To examine this question we performed a reweighting analysis with the combined 
dataset: the xt used to determine the weights are then the sum of those from the DO 
muon asymmetry data and the DO electron asymmetry data, i.e. N^a^t = 22 data points. 
The number of effective replicas is then reduced to A^eff = 356, actually a number larger 
than the case where electron data alone were considered: the muon data soften the impact 
of these data. The combined xt ^^d a distributions (see Fig. [T3|) are now better behaved: 
while before the reweighting only 49% of replicas have a ^ < < 2, after reweighting this 
now rises to 99%. The peak of the a distribution is now closer to one: the overestimated 
uncertainties of the muon data in part compensate for the underestimated uncertainties of 




Figure 11: Total valence PDF for NNPDF2.0 and NNPDF2.0 + DO electron data (bin A). 
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Figure 12: Distances between NNPDF2.0 and NNPDF2.0 + DO W lepton charge asymmetry 
measurements from the electron bin A dataset. The NNPDF2.0 set with iVj-cp = 100 has been used 
in the computation of the distances. 



the electron data. The quality of the fit to the other datasets included in the NNPDF2.0 
fit shows no significant deterioration, and again there is a slight improvement, in the fit 
to the CDF W asymmetry data (see Tab. [2|). 

In Fig. [13] we show the effect of the addition of the DO muon and the DO electron 
inclusive data on the valence distribution. The precision of the valence distribution is 
significantly improved, though without shifting its central value significantly. This implies 
that the NNPDF2.0 set is quite consistent with the inclusive data, so that their addition 
entails only PDF uncertainty reduction without affecting central values. It follows that the 
d/u ratio extracted from the DIS deuterium data and the CDF direct W charge asymmetry 
data will be consistent with the information included in the DO inclusive muon and electron 
data. 

The main statistical estimators for the lepton charge asymmetry data sets are sum- 
marized in Tab. [3l The two inclusive sets have a significant impact on PDFs, and are 
reasonably consistent with themselves (though the experimental uncertainties on the in- 
clusive DO electron charge asymmetry data, bin A, may be a little underestimated), with 
the other data used in the NNPDF2.0 fit, in particular the CDF W charge asymmetry 
data, and with each other. 

The values for the total dataset and for the individual experiments in the NNPDF2.0 
analysis are shown in Table El The sets that differ sizably from the reference results have 
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Figure 13: The reweighted x^-distribution and the probabihty distribution 7^(0) in the reweighting 
of the NNPDF2.0 PDF set using the combined DO muon asymmetry data and electron asymmetry 
data (bin A). 
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Figure 14: Total valence PDF for the NNPDF2.0 and NNPDF2.0 + DO muon + DO electron data 
(bin A) sets. 

been highlighted in boldface in the different cases. As far as the inclusive muon and 
electron datasets are concerned we notice that both are consistent with the NNPDF2.0 
datasets, and their inclusion improves the fit to the W asymmetry data. Furthermore they 
are both consistent with each other. These conclusions do not support the conclusions of 
the MSTW08 analysis |12j . which finds that inclusion of the DO electron inclusive bin in 
the global fit, without significant deterioration in the fit to the other datasets, requires 
sizable nuclear corrections to deuterium data. 

5.3 More exclusive data 

We now turn to the less inclusive DO electron charge asymmetry data (bin B and bin C), 
where the transverse energy of the electron is restricted to the range 25 GeV< Ej. < 35 
GeV and > 35 GeV respectively. 

We first consider each bin separately, and we turn then to their combination. Consid- 
ering first the lower E^ bin (bin B), the number of effective replicas is now reduced to 
-^eff = 61, indicating that, as expected, these data are more constraining than those of 
the inclusive bin. This is so because the data binned in E^ probe a more localized region 
in X of the PDFs as compared to the inclusive data. The ^or this set drops from 4.75 
to 1.12 after the data is included. From the plots in Fig. [15] we see that indeed there is 
now a significant fraction of very small weights, because many of the replicas fit the new 
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Figure 15: Distribution of the xl ^^'^ weights Wk, the reweighted ^^-distribution and the 
probability distribution Via) in the reweighting of the NNPDF2.0 PDF set using the DO electron 
asymmetry data bin B 



data rather badly. After reweighting the distribution improves very significantly: while 
before reweighting only 4.8% of replicas were in the range | < < 2, after reweighting 
this increases to 86.5%. However the rescaling plot peaks at around a ~ 2 indicating that 
the errors on the data are significantly underestimated. 

The improvement in the fit to the lowest bin charge asymmetry data is manifest 
on the left of Fig. [TSl However the fit to some of the other datasets in the NNPDF2.0 fit, 
in particular BCDMSp and BCDMSd, becomes significantly worse (see Table [2]). The fact 
that there is as much tension here with the proton data as with the deuteron data suggests 
that it is unlikely that nuclear corrections to the deuteron target can help (contrary to 
the claim in Ref. jT2]). The overall P^r degree of freedom rises from 1.14 to 1.16: this 
is rather significant, given that we are only adding 12 new data points to the 3415 used 
in NNPDF2.0. The decrease in the fit quality is driven by the large weight that BCDMS 
carry in the global fit. It should be further noted that the fit to the inclusive jet data and 
CDF W asymmetry also worsens. 

These problems are also apparent when we look at the effect on individual PDFs: 
in particular while the valence distribution. Fig. [T6l is now better determined in some 
ranges of x, elsewhere the uncertainty increases. While this may in part be due to the 
rather limited statistics of the reweighted distribution, it is probably also a sign of some 
inconsistency with the other data used in the NNPDF2.0 fit. The statistical distances, 
plotted in Fig. [T71 are also sizable for some PDFs, especially the valence distribution. 

We finally consider the remaining DO electron asymmetry dataset at highest Et (bin 
C): the results are displayed in Fig. [191 While the impact of these data is similar to that 
of the lowest Et bin, (bin B), with the effective number of replicas dropping to 68, the 
quality of the fit to the unweighted replicas is so poor (there are no replicas with a 




Figure 16: Total valence PDF for NNPDF2.0 and NNPDF2.0 + DO electron data (bin B). 
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Figure 17: Distances between NNPDF2.0 and NNPDF2.0 + DO W Icpton asymmetry measure- 
ments from the electron bin B dataset. The NNPDF2.0 set with A'rcp = 100 has been used in the 
computation of the distances. 

below 2) that even after reweighting the quality of the fit is still not very good, the average 
per data point dropping from 5.06 to 2.51. 

The rescaling plot shows a preferred value of a ~ 2.3, suggesting that again the 
experimental errors in these data are seriously underestimated. This might be caused by 
some underestimated systematic uncertainties in the separation of the data into bins of 
different Ej,. The poor quality of the fit, even after reweighting, is again apparent in 
Fig. [HI it is clear that some of the bins simply cannot be fitted with a reasonably smooth 
distribution. There is also tension between these data and some of the other datasets 
included in NNPDF2.0 (see Table E]): in particular while BCDMS is now fine, the fit to 
the NMC-pd ratio is spoiled. 

When we examine the effect of these data on the PDFs, we see (Fig. [20|) that rather 
than making the PDFs more precise, in many regions of x the uncertainty increases sub- 
stantially. The enlarging of the uncertainty is of course what one would expect when 
inconsistent data are combined, and it was previously seen to occur in NNPDF parton fits 
(see e.g. Sect. 3.4.1 of Ref. |25]). Here, it is shown to occcur as a consequence of standard 
statistical inference. 

We also attempted a combined fit of the DO electron asymmetry data bins B and C, 
but the constraint imposed on PDFs by including these data togehter is so severe that the 
number of effective replicas is reduced to one. This shows that not only are these data 
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Figure 18: W electron asymmetry computed on the NNPDF2.0 set before and after the reweighting 
of the DO W electron asymmetry: bin B (on the left) and bin C (on the right). 



each inconsistent with other data included in the global fit, but they are also inconsistent 
with each other. 

The main statistical estimators for the exclusive electron charge asymmetry data sets 
are summarized in Tab. [31 In contrast to what we observe for the inclusive sets, these data 
sets, while having an even greater effect on the PDFs, appear to be internally inconsistent 
(bin C), inconsistent with other data used in NNPDF2.0, particularly BCDMS proton and 
deuteron data (bin B), and also inconsistent with each other. 

The results in the present study cannot be directly compared to the ones obtained 
in the CTIO analysis, because there the three electron bins are added simultaneously 
to the fit. On top of the double counting problem, this is problematic because internal 
tensions of experimental origin between the different bins might be mistaken for a physical 
effect, such as nuclear corrections. Indeed, we have shown that the more exclusive data sets 
are not only inconsistent with other sets in the global analysis but also inconsistent among 
themselves, so that it is probably not a good idea to include both in the fit simultaneously. 

5.4 Implications for the d/u ratio, and LHC benchmarks 

Up to now we have considered the impact of the DO data on different PDF combinations, 
noticing that the most relevant effect was on the total valence distribution. To conclude 
our analysis we assess the impact of W lepton charge asymmetry data on the d/u ratio. 
In Fig. [21] we display the d/u ratio at large x computed at Qq = 2 GeV^ from the original 
NNPDF2.0 set and the four sets obtained through reweighting of NNPDF2.0 with the DO 
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Table 3: A summary of the results of reweighting with the DO W lepton asymmetry data: the 
fraction iVoff/1000 of replicas left after reweighting, the most probable value Oopt of the error 
rescaling parameter a, the per data point to the DO W lepton data evaluated before and after 
the reweighting, and the total per data point to all the other data in the NNPDF2.0 fit. 
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Figure 19: Distribution of the xl ^^'^ weights Wk, the reweighted ^^-distribution and the 
probability distribution Via) in the reweighting of the NNPDF2.0 PDF set using the DO electron 
asymmetry data bin C [9^ 



lepton asymmetry data. The effect of the inclusive datasets (muon and electron bin A) is 
rather small, even when they are combined together. The less inclusive sets (bin B and 
bin C) have a rather larger effect, but pull in opposite directions. Even so, the effect is 
only of the same order as the PDF uncertainty. 

In Fig. [22] we compare the d/u ratio obtained with NNPDF2.0 and with NNPDF2.0 
reweighted by the maximally consistent combination of the DO data (muons and electrons 
bin A) with the CTIO and CTIOW results, normalized to NNPDF2.0. It can be seen that 
the combination of DO muon and electron bin A data leads to a substantial error reduction 
of ~ 25% in the d/u ratio in the 0.1 <x< 0.5 region, with almost no change in the central 
value. Note also that the d/u ratio obtained from the NNPDF2.0 + DO(ebinA-|-/i) set is 




Figure 20: Total valence for NNPDF2.0 and NNPDF2.0 + DO electron data (bin C). 
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Figure 21: The d/u ratio at large x computed at = 2 GeV'^ froni the original NNPDF2.0 sets 
and the various sets obtained through reweighting of NNPDF2.0. We show the results for the ratio 
normalized to NNPDF2.0 (left plot) and the relative PDF uncertainties in each case (right plot). 
All uncertainties are Icr. 
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Figure 22: Lower plots: The same d/u ratio from the original NNPDF2.0 sets, the NNPDF2.0 set 
reweighted by the maximally consistent combination of the DO lepton asymmetry data (the muon 
data plus the inclusive electron data) and the CTIO and CTIOW sets. 

rather more precise than that from CTIOW, despite the fact that they include all the DO 
lepton datasets, with larger weights than the other datasets in the global analysis. 

Finally, it is interesting to ask to what extent the inclusion of the W lepton charge 
asymmetry data through reweighting affects the determination of some of the LHC stan- 
dard cross-sections. Results for vector boson production, Higgs and ti at ^/s = 7 TeV are 
collected in Table [H They have been computed using MCFM [26H28j to determine the 
cross-section for each replica, and then the weighted average of the results evaluated using 
Eq. ([2]). The uncertainties in each case are purely PDF uncertainties obtained from a 
reweighted evaluation of the variance of the cross-section. Clearly all these cross-sections 
are by and large insensitive to the addition of the DO lepton charge asymmetry data, 
even those data (bins B and C) which show inconsistencies with the global dataset and 
thus have the largest (though least reliable) effect. This is to be expected since the LHC 
observables we have considered are not directly sensitive to large-x quarks, for which the 
impact of the DO data is the largest. 
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Table 4: Cross sections for diiTcrcnt Standard Candle processes at the LHC (7 TeV) computed 
using NNPDF2.0 reweighted PDFs including DO W lepton asymmetry data. The Higgs cross- 
section is computed for rrih = 120 GeV. 



6 Conclusions 



In this paper we have developed a method for determining the effect of new data on PDFs 
without the need for a global refitting. The method relies on the existence of an ensemble 
of PDFs, distributed according to the uncertainties in a global set of older data, and thus 
representing the prior probability distribution of the PDFs. Such ensembles are provided 
by the NNPDF collaboration. The effect of new data is then accounted for by reweighting 
the PDF replicas in the ensemble according to their relative probabilities given the new 
dataset. These probabilities are determined simply and easily by computing the of the 
new data to the prediction obtained using a given replica. 

We have provided a careful derivation of our formula used to determine the weights. 
This is important because our result differs from that obtained in a previous attempt to 
use a reweighting method [5]. The derivation is subtle because it is necessary to deal 
with multi-dimensional probability densities, where unless one is careful one can fall into 
inconsistencies due to the Borel-Kolmogorov paradox [16j. 

The main advantage of the new method is clear: computing the weights is no more 
difficult or computer intensive than the usual procedure of preparing a plot comparing 
the new dataset with predictions from given PDFs. However the information provided is 
much more substantial - one can assess quantitatively the impact of the new data on the 
PDFs, whether the new data are consistent with all the older data encoded within the 
PDF ensemble and the theoretical assumptions on which it was based, and then whether 
the new data have any effect on other observables of interest such as benchmark cross- 
sections. Only when the impact of the new data is very large does a full refitting of the 
PDF ensemble become necessary, due to the loss of efficiency in the reweighted ensemble. 

We thus envisage our method being useful to experimentalists in all sorts of situations: 
testing the reliability of preliminary datasets and their uncertainties, assessing the credi- 
bility of possible indications of new physics, or in optimizing the design of new experiments 
using pseudodata. 

We have shown explicitly that the method works by considering the addition of Teva- 
tron inclusive jet data to a prior parton fit using only DIS and DY data. We have seen 
that when reweighted by the inclusive jet data, this fit becomes statistically equivalent 
to a refitting using all the data. The statistical equivalence has been quantified using 
the distance between prior and reweighted sets. This confirms that the refitted and the 
reweighted PDF sets can be seen as two samples of the same underlying probability dis- 
tribution. This is simultaneously a validation of the reweighting methodology, and an 
important a posteriori consistency check of the fitting procedure: an explicit confirmation 
that reweighting is equivalent to refitting for all data included in the global fit would 
amount to a proof that the fitted result is indeed that dictated by the laws of statistical 
inference. 

Using the reweighting formalism we have determined the impact of recent high lumi- 
nosity DO Run II lepton asymmetry data on the NNPDF2.0 PDFs. The lepton asymmetry 
data has been historically an important constraint on the large- a; d/u ratio, but recent 
attempts [HlllS] to include the new DO data into global fits have been problematic. We 
find instead that the data which are inclusive in Elp, the muon asymmetry data [8] and 
electron asymmetry data [9] with E[ > 25 GeV, are fully consistent with the NNPDF2.0 
predictions and have a have a moderate impact on PDFs, showing up as a modest though 
noticable reduction in the uncertainty of the valence quark distribution. Moreover they 
are consistent with each other and with all the other datasets included in NNPDF2.0. 

The consistency of these data has been recently studied also by the MSTW and CTEQ 



collaborations. In particular MSTW [T2] finds that it is not possible to fit the inclusive 
DO electron dataset without affecting the description of the rest of the experiments in the 
global analysis unless large nuclear corrections for the DIS deuteron data are applied at 
the same time. The CTIO analysis [llj also suggests a sizable tension between the DO 
lepton asymmetry data and the DIS deuteron data. Our results do not support these 
conclusions. Since the predictions for the lepton asymmetry depend strongly on the d/u 
slope, it is possible that the origin of the problems in the CTIO and MSTW analysis is 
that they are based on refitting using a fixed parametrization, and are thus subject to the 
functional biases such a procedure necessarily entails. 

We further find that the less inclusive electron asymmetry data [9] binned in E^^ the 
two datasets with 25 GeV < < 35 GeV and 35 GeV < Ej^, while having potentially 
more impact on the PDFs, are problematic: the former data set is inconsistent with some 
of the DIS data (specifically BCDMS, both proton and deuteron), while the latter seems 
to have problems of internal consistency!! Consequently the effect on PDFs of including 
these datasets is to actually increase uncertainties in some regions of x. Furthermore, we 
find evidence that these two datasets are also mutually inconsistent. We think it likely 
that the experimental errors on these data have been substantially underestimated. Until 
these problems are better understood, we believe that is safer to include in the global fit 
only the inclusive datasets, which even if less constraining are more robust experimentally. 

The reweighting methodology described here should allow anybody to perform their 
own updates of NNPDF fits, to incorporate whatever new datasets they are interested 
in, by following the same procedure we used here for the specific case of the W lepton 
asymmetry. We very much hope that they will exploit this possibility. 
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'''Similar difficulties in fitting these datasets have been reported by MSTW [T? and CTEQ [TT], though 
it is not easy to make a direct comparison since they attempt to fit all three DO electron bins simultaneously. 



A Distances between reweighted PDFs 



Given two sets of A'^r^'ep and A^rep replicas, in general reweighted, it is possible to use the 
distance estimators defined in Appendix A of Ref. [3] to determine whether they correspond 
to different instances of the same underlying probability distribution, or whether instead 
they come from different underlying distributions. 

The discussion in Ref. |4J applies also to reweighted PDF sets with the corresponding 
modifications that we list below. For example, expectation values have to be computed 
with the associated weights. For the first, second and fourth moments of the PDFs one 
then has to use 
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Note that in the above equations the unweighted expressions are trivially reproduced 
setting u)^*^ = 1. 

Another difference arises when computing the variance of the mean and the variance 
of the variance with weighted PDF sets. In this case, this estimators scale not with the 
total number of replicas but with an effective number of replicas after reweighting 
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that reduces to A'rcp in the unweighted case (note that this is not the same as the A'eff 
given by the Shannon entropy Eq. (jlOp . 

The variance of the mean for reweighted sets is then given by 
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while the variance of the sample variance is 
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Again in the unweighted case everything reduces to the expression in Ref. [1]. 



References 

[1] The NNPDF Collaboration, L. Del Debbio et al, JHEP 03 (2007) 039, 
hep-ph/0701127. 

[2] The NNPDF Collaboration, R.D. Ball et al, Nucl. Phys. B809 (2009) 1, 
arXiv:0808.1231. 

[3] The NNPDF Collaboration, R.D. Ball et al, Nucl. Phys. B823 (2009) 195, 
arXiv:0906.1958. 

[4] The NNPDF Collaboration, R.D. Ball et al, Nucl. Phys. B838 (2010) 136, 
arXiv: 1002.4407. 

[5] W.T. Giele and S. Keller, Phys. Rev. D58 (1998) 094023, hcp-ph/9803393. 

[6] The NNPDF Collaboration, L. Del Debbio et al, JHEP 0503 (2005) 080, 
hep-ph/0501067. 

[7] The CDF collaboration, D.E. Acosta et al, Phys. Rev. D71 (2005) 051104, 
hep-ex/0501023. 

[8] The DO collaboration, V.M. Abazov et al, Phys. Rev. D77 (2008) 011106, 
arXiv:0709.4254. 

[9] The DO collaboration, V.M. Abazov et al, Phys. Rev. Lett. 101 (2008) 211801, 
arXiv:0807.3367. 

[10] M. Vesterinen, DO Conference Note 5976-CONF (2010), arXiv:1006.0451. 
[11] H.-L. Lai et al, Phys. Rev. D82 (2010) 074024, arXiv: 1007.2241. 
[12] R.S. Thorne et al, PoS (2010) 052, arXiv:1006.2753. 
[13] W.T. Giele, D. Kosower and S. Keller, hep-ph/0104052. 
[14] F. De Lorenzi, arXiv:1011.4260. 

[15] The NNPDF Collaboration, R.D. Ball et al, JHEP 1005 (2010) 075, 
arXiv:0912.2276. 

[16] E.T. Jayncs, "Probability Theory: The Logic of Science", Cambridge University 
Press, (2003), ISBN 0-521-59271-2. 

[17] The CDF collaboration, A. Abulencia et al, Phys. Rev. D75 (2007) 092006, 
hep-ex/0701051. 

[18] The DO collaboration, V.M. Abazov et al, Phys. Rev. Lett. 101 (2008) 062001, 
arXiv:0802.2400. 

[19] The CDF collaboration, T. Aaltonen et al, Phys. Rev. Lett. 102 (2009) 181801, 
arXiv:0901.2169. 

[20] T. Carli et al, Eur. Phys. J. C66 (2010) 503, arXiv:0911.2985. 
[21] A.D. Martin et al, Eur. Phys. J. C63 (2009) 189, arXiv:0901.0002. 



[22] S. Catani et al, Phys. Rev. Lett. 103 (2009) 082001, arXiv:0903.2120. 

[23] S. Catani, G. Ferrera and M. Grazzini, JHEP 05 (2010) 006, arXiv:1002.3115. 

[24] C. Balazs and C. P. Yuan, Phys. Rev. D56 (1997) 5558, hep-ph/9704258. 

[25] M. Dittmar et al, arXiv:0901.2504. 

[26] J.M. Campbell and R.K. Ellis, Phys. Rev. D62 (2000) 114012, hep-ph/0006304. 

[27] J. Campbeh and R.K. Ellis, Phys. Rev. D65 (2002) 113007, hep-ph/0202176. 

[28] MCFM, http://mcfm.fnal.gov. 



